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Abstract 

Cauliflower mosaic virus (CaMV) is a plant pararetrovirus with a double-stranded DNA genome. It is the type member of the 
genus Caulimovirus in the family Caulimoviridae. CaMV is transmitted by sap inoculation and in nature by aphids in a semi- 
persistent manner. To investigate the patterns and timescale of CaMV migration and evolution, we sequenced and analyzed 
the genomes of 67 isolates of CalVlV collected mostly in Greece, Iran, Turkey, and Japan together with nine published 
sequences. We identified the open-reading frames (ORFs) in the genomes and inferred their phylogeny. After removing 
recombinant sequences, we estimated the substitution rates, divergence times, and phylogeographic patterns of the virus 
populations. We found that recombination has been a common feature of CalVlV evolution, and that ORFs l-V have a 
different evolutionary history from ORF VI. The ORFs have evolved at rates between 1.71 and 5.81 xlO"'* substitutions/site/ 
year, similar to those of viruses with RNA or ssDNA genomes. We found four geographically confined lineages. CaMV 
probably spread from a single population to other parts of the world around 400-500 years ago, and is now widely 
distributed among Eurasian countries. Our results revealed evidence of frequent gene flow between populations in Turkey 
and those of its neighboring countries, with similar patterns observed for Japan and the USA. Our study represents the first 
report on the spatial and temporal spread of a plant pararetrovirus. 
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introduction 

Studies of the population genetics of plant viruses are important 
for understanding the evolution of virus-host interactions [1-3], 
because plant viruses sometimes adapt rapidly to new or resistant 
hosts [4-6]. Most evolutionary studies of plant viruses have 
focused on those with single-stranded RJMA (ssRNA) genomes [3] , 
[7-9], partly because many plant viruses have such genomes. 
Another reason for this focus is that they have error-prone RNA 
polymerases, and therefore evolve at a measurable rate which 
complicates the creation of resistant plant cultivars. Populations of 
plant viruses with single-stranded DNA (ssDNA) genomes have 
also been studied, including those of begomoviruses and 
mastreviruses in the family Geminiviridae, which also evolve at a 
measurable rate, are emergent viruses and damage many crops 
worldwide [10-15]. These reports showed that virus populations 
have been shaped by selection, founder effects, and recombina- 
tion. On the other hand, there has been littie work on the 
population genetics of plant viruses with double-stranded DNA 
(dsDNA) genomes. 

Cauliflower mosaic virus (CaMV) has a dsDNA genome and is the 
type species of the genus Caulimovirus in the family Caulimoviridae 



[16]. Although it infects plants, CaMV is grouped with the 
hepadnaviruses of animals as a pararetrovirus because it has 
icosahedral virions and because its replication strategy involves an 
RNA intermediate [16]. CaMV is transmitted by sap inoculation, 
and in nature by aphids such as Brevicoryne brassicae, Myzus persicae, 
and at least 25 other species in a semi-persistent manner. CaMV 
reduces the yield and quality of brassica crops worldwide. In 
nature, its host range seems to be limited to plants of the family 
Brassicaceae, but some isolates are able to infect plants of the 
family Solanaceae experimentally [17]. 

The genome of CaMV is a circular dsDNA molecule of about 
8000 nt with three short single-stranded regions: two in one 
strand, one in the other [18]. It has seven open reading frames 
(ORFs) and large and small intergenic regions [16]. Located 
between ORF VI and ORF I, the large intergenic region contains 
the pregenomic RNA 35S promoter, the RNA polyadenylation 
signal, and the minus-strand primer-binding site. The small 
intergenic region, containing the 1 9S promoter, is located between 
ORFs V and VI. The genome encodes six viral gene products that 
have been detected in planta. Protein PI is the cell-to-cell 
movement protein, P2 is the aphid transmission factor, P3 is the 
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virion-associated aphid transmission factor, P4 is the coat protein 
precursor, P5 is the polyprotein precursor of proteinase, reverse 
transcriptase, and ribonuclease, and P6 is the translation 
transactivator/viral silencing suppressor and also the major 
protein of the inclusion body matrix [19]. No protein encoded 
by ORF VII has been detected in pknta and the function of this 
ORF is still unknown [20]. 

The possibility of controlling a pathogen is improved if we know 
when, where, and how it first became established in the host 
population of interest, namely its 'centre of emergence'. This is 
analogous to the 'centre of diversity' of crop species [21]. Brassicas 
are major vegetable crops for human and animal consumption 
worldwide; most cultivars originated in South-West Eurasian 
countries [22-26]. Many plant viruses infect these crops, with 
Turnip mosaic vims (TuMV), Cucumber mosaic virus (CMV), and 
CaMV being particularly well known. We assessed the population 
structure of TuMV in a number of pre\ i()us studies [9], [27-29]. 

Only nine full nucleotide sequencx's of CaMV genomes have 
been reported so far [30], providing insufficient data to 
characterize the population structure of the virus. Here, we report 
the genomit: sequences of CaMV isolates from brassica hosts in the 
Eurasian region. We analyzed our 69 new sequences in 
combination with nine published sequences to estimate the 
phylogeny, the evolutionary timescale, and the degree of 
divergence between populations in different countries. Our 
analyses provide insights into the spatial and temporal evolution 
of several CaMV populations. 

Materials and Methods 

Virus isolates and host tests 

We surveyed the brassica crop-producing areas of Greece, Iran, 
Turkey, and Japan during the growing seasons of 2001-2010. AU 
collected samples were tested by direct double-antibody sandwich 
enzyme-linked immunosorbent assay (DAS-ELISA) [31] using the 
antiserum to CaMV (BIOREBA, Switzerland). Some of the 
Japanese isolates were gifts from NIAS Genebank, Japan, whereas 
the remaining isolates were collected from private gardens and 
fields, with permission from owners. No specific permissions were 
required for the locations/activities. Our field studies did not 
involve endangered or protected species. Details of the CaMV 
isolates, their place of origin, original or common (English) host 
plant, year of isolation, and host type are shown in Table SI in File 
SI. 

All of the isolates were sap-inoculated to the Japanese Brassica 
rapa cv. Hakatasuwari plants and serially cloned through single 
lesions at least three times using chlorotic local lesions that 
appeared approximately 10 days after the inoculation. The 
biological cloning step is important because CaMV was often in 
mixed infections with TuMA' and/ or CMV, and some plants 
contained a mixture of CaMV strains (data not shown). Hence, 
there is a possibilit)' that artificial recombination events will be 
detected in the sequence data. Purified CaMV isolates were 
propagated in Brassica rapa cv. Hakatasuwari plants. Plants infected 
systemically with each of the CaMV isolates were homogenized in 
0.01 M potassium phosphate buffer (pH 7.0) and mechanically 
inoculated onto young Japanese cultivars of B. rapa cv. 
Hakatasuwari, and Raphanus sativus cvs. Akimasari-2go, Taibyo- 
sobutori and Everest, B. oleracea var. capitata cvs. Shinsei, Ryosan 
2go and Soushu, B. okracea var. botiytis cv. Snow queen, B. oleracea 
var. italica cv. Challenger, B. oleracea var. Gongylodes cv. Grand duke, 
B. napus cv. Otsubu, B. pekinensis cv. Nozaki 1 -go, and B. campestris 
var. Narinosa cv. Tatsuai. Inoculated plants were kept for at least 
four weeks in a glasshouse at Saga University at 25 °C. 



Viral DNA and sequence data 

Viral DNAs were extracted from CaMV-infected B. rapa leaves 
using DNeasy Plant Mini Kit (Qiagen K.K., Japan). The DNAs 
were amplified using high-fidelity Platinum Pfx DNA polymerase 
(Invitrogen, Japan). The PGR products were separated by 
electrophoresis in agarose gels and purified using a QIAquick 
Gel Extraction kit (Qiagen K.K., Japan). Sequences from each 
isolate were determined using at least three overlapping indepen- 
dent PCR products to cover the complete genome. The serjuences 
of the PCR products or cloned fragments of adjacent regions of the 
genome overlapped by at least 300 nt to ensure that they were 
from the same genome and were not from different components of 
a genome mixture. Each PCR product was sequenced by primer 
walking in both directions using a BigDye Terminator v3.1 Cycle 
Sequencing Ready Reaction kit (Life Technologies, Japan) and an 
Applied Biosystems Genetic Analyzer DNA model 310. Ambig- 
uous nucleotides in any sequence were checked in sequences 
obtained from at least three to five other independent plasmids, 
which were cloned into iscoR V site of plasmid pZErO-2. 
Sequence data were assembled using BioEdit v5.0.9 [32]. 

The genomic sequences of the 76 isolates were used for a range 
of evolutionary analyses. The genomic sequence of ID 1 isolate of 
Horseradish latent virus (HRLV, Accession code NC_0 18858) was 
used as an outgroup because BLAST searches showed that it was 
most closely and consistently related to those of CaMV. We 
aligned all genes via the corresponding amino acid sequences using 
CLUSTAL X2 [33] witii TRANSALIGN (kindly supplied by 
Georg Weiller, Australian National University). ORF I to ORF V 
sequences were then reassembled to form concatenated ORF I-V 
sequences of 5,106 nt. We discarded overlapping sequences 
between ORF III and ORF IV (9 nt) and between ORF IV and 
ORF V (23 nt). The aligned ORF VI sequences were 1,554 nt in 
length. 

Recombination analyses 

We investigated recombination in the genomic sequences using 
RDP [34], GENECONV [35], BOOTSCAN [36], MAXCHI 
[37], CHIMAERA [38], and SISCAN [39], all implemented in 
RDP4 [40]. We also analyzed the data using the original 
PHYLPRO [41], SISCAN version 2 [39]. These analyses were 
done using default settings for the different detection programs 
and a Bonferroni-corrected P-value cut-off of 0.01, and overlap- 
ping 100- and 50-nt sUces. These analyses also assessed which non- 
recombinant sequences contained regions that were most closely 
related to those of the recombinant sequences, indicating the 
lineages that most likely provided those regions of the recombinant 
genomes. For simplicity, we refer to these as the 'parental isolates' 
of recombinants. To examine the impact of gaps introduced when 
aligning the CaMV sequences to the outgroup, we checked for 
evidence of recombination after aligning the CaMV with the 
outgroup excluded. Finally, the aligned sequences were analyzed 
for recombination using the Recombination Analysis Tool (RAT) 
[42]. This analysis compared the percentage of nucleotide 
similarities using a sliding window of 30 nt, allowing detection of 
breakpoints among sequences. 

We included the recombinant genomes in our analyses of 
individual ORFs when there was no evidence of within-ORF 
recombination, but discarded recombinant genomes for our 
phylogenetic estimates of rates and timescales. Moreover, we 
discarded 192 nt and 93 nt of the 5' and 3' ends, respectively, 
from the aligned ORF VI 1,554 nt. Specifically, we discarded both 
of the ends from the major recombination sites that were found, 
and used 1 ,269 nt for the subsequent ORF VI analyses. 
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Figure 1. Multidimensional scaling of tree-to-tree patristic 
distances. ORF I vs ORF III isolates (A); ORF I vs ORF VI isolates (B); ORFs 
l-V vs ORF VI isolates (C); ORFs l-V Group A vs ORF VI isolates (D); and 
ORFs l-V Group B vs ORF VI isolates (E). 
doi:10.1 371/journal.pone.0085641 .gOOl 

Estimation of substitution rates and divergence times 

The phylogenetic relationships of tlie sequences and of tlieir 
constituent ORFs were estimated using the Neighbor-Net method 
in SPLITSTREE v4.1 1.3 [43], and using maximum likelihood in 
PhyML v3 [44]. For the maximum-likelihood (ML) analysis, we 
used the general time-reversible (GTR) model of nucleotide 
substitution, with rate variation among sites modelled using a 
gamma distribution and a proportion of invariable sites 
(GTR+r4+I). This model was selected in R [45] using the 
Bayesian information criterion, which has been shown to perform 
well in a variety of scenarios [46] . Branch support was evaluated 
by bootstrap analysis based on 1000 pseudoreplicates. The 
maximum-likelihood trees were compared using PATRISTIC 
[47]. 

We performed Bayesian phylogenetic analyses in BEAST vl.7.5 
[48] to estimate the evolutionary rate and timescale of CaMV. 
The sampling times of the sequences were used as calibrations for 
the molecular clock. We used Bayes factors to select the best-fitting 
molecular-clock model and coalescent prior for the tree topology 
and node times. We compared strict and relaxed (uncorrelated 
exponential and uncorrelated lognormal) molecular clocks [49] 
and compared five demographic models (constant population size, 
expansion growth, exponential growth, logistic growth, and the 
Bayesian skyhne plot). We also tested for clocklike evolution using 
a regression of root-to-tip distances on viral sampling times in the 
software Path-O-Gen vl.3 (http://tree.bio.ed.ac.uk/software/ 
pathogen). 

Posterior distributions of parameters, including the tree, were 
estimated by Markov Chain Monte Carlo (MCMC) samphng. 
Samples were drawn every 10* MCMC steps over a total of lO" 
steps, with the first 10% of samples discarded as burn-in. 
Acceptable sampling from the posterior and convergence to the 
stationary distribution were checked using the diagnostic software 
Tracer vl.5 (http://tree.bio.ed.ac.uk/software/tracer/). 

To estimate substitution rates and divergence times from 
heterochronous sequence data, the sampling times need to have 
a sufficient spread in relation to the substitution rate [50]. We 
investigated the temporal structure in our data sets by comparing 
our rate estimates with those from ten date-randomized replicates. 
A data set was considered to have sufficient temporal structure 
when the mean rate estimate from the original data set was not 
contained in any of the 95% credibility intervals of the rates 
estimated from the date-randomized replicates. This follows the 
approach taken in previous studies [51], [52]. 

The spatial population dynamics of CaMV through time were 
inferred in BEAST using a diffusion model with discrete location 
states [48] . This approach uses an explicit model that describes the 
migration of CaMV lineages throughout their evolutionary 
history. The most important pairwise diffusions can be identified 
using Bayes factors [53]. Using SPREAD [54] and Google Earth 
(http://www.google.com/earth), we produced a graphical anima- 
tion of the estimated spatio-temporal movements of CaMV 
lineages. 

Demographic analyses 

DnaSP v5.0 [55] was used to estimate haplotype and nucleotide 
diversities. Haplotype diversity refers to the frequency and number 
of haplotypes in the population. Nucleotide diversity estimates the 
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^Recombination sites detected in the CaMV genomes by the recombination detection programs (listed in column 6), from the aligned sequences of the likely 
recombinant and its 'parental isolates'. The nucleotide position shows locations of Individual genes numbered as in Xinjing genome (AF140604). UD; Undetermined. 
"Recombinant Isolates identified by the recombination detection programs: R (RDP), G (GENECONV), B (BOOTSCAN), M (MAXCHI), C (CHIMAERA) and Sr (SISCAN) 
programs In RDP4, and So (SISCAN total nucleotide site analysis) In original SISCAN version 2 and P (PHYLPRO) programs. The analyses were done using default settings 
and a Bonferroni-corrected P-value cut-off of 0.01 in RDP4. 

*^The reported P-value Is for the program in bold type and underlined in RDP4 and is the smallest P-value among the isolates calculated for the region in question. P- 
values smaller than 1.0x10"= are listed. 
doi:1 0.1 371 /journal.pone.0085641 .tOOl 



average pairwise difTerences among sequences. Nonsynomynious 
(dN) and synonymous (dS) substitution (dN/dS) ratios were 
calculated for seven ORFs using the Pamilo-Bianchi-Li (PBL) 
method in MEGA v5 [56]. The program Structure v2.3.4 [57] was 
used to test for evidence of genetic structure among subpopula- 
tions and to identify individuals that were admixed or had 
migrated among populations. To select the number of clusters that 
best represented population structure, we performed analyses with 
1 to 10 subpopulations {K= 1 to 10), sampling from 10*' Markov 
chain steps after a burn-in of 10' steps. We identified the 
maximum delta- A' value to determine the best-supported number 
of subdivisions in the populations [58]. 

Results 

Biological characteristics of the CaMV isolates 

A total of more than 1000 samples collected during the 2001- 
2010 growing seasons in Greece, Iran, Japan, and Turkey were 
tested by DAS-ELISA. About 70 plants of B. napus (oUseed rape). 



B. oleracea (cabbage), B. oleracea var. italica (broccoli), B. oleracea var. 
botrytis (cauliflower), R. sativus (radish) and other brassicas were 
found to be infected with CaMV. The viruses were found in 
commercial fields as well as in home gardens. 

Bras.sica and Raphanus plants were systemically infected by most 
isolates. Although they had very minor differences in pathogenicity 
in Brassica and Raphanus plants, we concluded that most isolates 
were of a similar pathotype. In contrast, three [fPNN, JPNSl, and 
JPNS2) of the ten isolates collected in Japan showed very faint 
symptoms in both Brassica and Raphanus plants, and we call these 
attenuated isolates. 

Genome sequences 

The complete genomes of 67 CaMV isolates were sequenced in 
the present study. The genomes of Eurasian isolates determined in 
the present study were 7984—8063 nt in length, with ORF lengths 
of 978-984 nt (ORF I), 459-480 nt (ORF II), 390 nt (ORF III), 
1458-1512 nt (ORF IV), 2025-2040 nt (ORF V), 1560-1575 nt 
(ORF VI), and 285-291 nt (ORF VII). Furthermore, the large 
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Figure 2. Cluster-based analysis of population subdivision using Structure. The results are grouped by population of origin for each 
individual. Each individual is represented by a column. The number of clusters is indicated by the value of K: ORFs l-V, K=6 (A), ORE VI, /(= 5 (B). The 
colour proportion for each bar represents the posterior probability of assignment of each individual to one of six clusters (A) and one of five clusters 
(B) of genetic similarity. Clusterings correspond to those shown in Figure SI in File SI. 
doi:1 0.1 371 /journal.pone.0085641 .g002 



intergenic regions located between ORFs VI and VII were 704— 
784 nt in length, whereas the small intergenic regions located 
between ORFs V and VI were 103-104 nt in length. All of the 
motifs reported for different caulimovirus-encoded proteins were 
found. The new genomic sequences determined in this study are 
available in DDBJ/EMBL/GenBank databases with accession 
codes AB863136-AB863202. 

Patristic distance plots 

We made pairwise comparisons of the maximum-likelihood 
trees of the individual ORFs using PATRISTIC. All pairwise plots 
of the distances in the trees inferred from the ORFs I, II, III, and 
IV gave similar patterns. This is illustrated by the plot of ORF I 
against ORF III distances (Figure lA), in which the two sets of 
distances had a linear correlation coefficient of 0.516 (p<0.001). 
The plots of the ORF V distances against those of ORF I to IV 
showed that ORF V might have two slightly different but 
overlapping populations of distances (data not shown). By contrast, 
plots of the ORF VI distances against those of ORFs I-V, either 
individually (Figure IB) or concatenated (Figure IC), showed that 
there were two completely distinct lineages of ORF VI, and these 



were distinct from those in ORF V. Furthermore, plots of the 
Group A and Group B ORF VI distances against those of the 
concatenated ORFs I-V (Figure ID and E) showed that the two 
sublineages were distinct. The patristic distances of the ORF VII 
tree gave much more complex patterns when plotted against those 
of the other ORFs. However, because ORF VII is much shorter 
than the other ORFs, it is possible that this apparent complexity is 
an artefact of sampling. Overall, the PATRISTIC plots supported 
concatenation of ORFs I-V for subsequent evolutionary analyses. 
We analyzed ORF VI separately and omitted ORF VII from our 
analyses. 

Recombination analyses 

Clear evidence of non-tree like evolution was indicated by the 
SplitsTree analyses (Figure SI in File SI). These indicated that 
there might be recombinant regions in both ORFs I-V and ORF 
VI. We analyzed the protein-encoding gene sequences of 67 
CaMV isolates and nine published sequences for evidence of 
recombination. Many clear recombination sites were detected 
throughout the CaMV genomes (Table 1, Figure S2 in File SI). 
Sites were found at 5' and 3' sequences of ORF VI at nt 5996 in 
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Table 2. Details of the data sets used for estimation of nucleotide substitution rate and time to the most recent common ancestor 
for Cauliflower mosaic virus. 





Parameter 


Open reading frame 






l-V 


VI 


Best-fit substitution model 


GTR-l-l+r4 




Best-fit molecular clock model 


Relaxed Uncorrelated Exponential 


Relaxed Uncorrelated Exponential 


Best-fit population growth model 


Exponential growth 


Constant size 


Sequence length (nt) 


5106 


1269 


No. of sequences 


66 


97 


Sampling date range 


1960-2010 


1960-2012 


Chain length (in millions) 


100 


100 


TMRCA" (years) 


491 (86-1270) 


431 (113-886) 


Substitution rate (nt/site/year) 


1 .71 X 1 0"" (1 .45 X 1 0^^-3.87 x 1 0"*) 


5.81 xlO"" (2.47x10"''-9.47x10"'') 


dN/dS" 


0.069 


0.201 


No. of variable sites 


1074 


448 



^Time to the most recent common ancestor. 

"^Nonsynomymous (dN) and synonymous (dS) substitution (dN/dS) ratios were calculated for seven ORFs using the Pamilo-Bianchi-Li (PBL) method in MEGA v5 [56]. 
doi:l 0.1 371 /journal.pone.0085641 .t002 



the genomes of isolates from Iran and Japan, and at nt 7348 in 
Greek isolates. Some recombination sites were found in other 
Turkish genomes, but many were not statistically significant. 

Phylogenetic analyses 

Networks and phylogenetic trees were inferred from concate- 
nated ORFs I-V and from ORF VI. The network inferred from 
ORFs I-V had short internal branches (Figure SI A in File SI). In 
contrast, the ORF VI sequences showed two major lineages of 
CaMV separated by long branches (Figure SIB and Figure S3 in 
File SI). Each of the subgroups in ORFs I-V and ORF VI 
contains isolates collected in a geographically confined area. 

The major differences between trees from ORFs I-V and from 
ORF VI are found in the relationships among the subgroups, not 
the subgroup membership. The maximum-likelihood bootstrap- 
ping analysis showed strong support for the various nodes in the 
ORF VI tree (as in Figure S3 in File SI). In contrast, the tree from 
ORFs I-V only had strong support at the subgroup level, with the 
basal nodes having support values below 30%. ORFs I-V and 
ORF VI yielded maximum-likelihood trees with very different 
relative branch lengths. In the ORF VI tree, the two basal 
branches span two-thirds of the mean root-to-tip distance of the 
tree, compared with only one-tenth in the ORFs I-V tree. 

The ORF VI tree partitions most of the sequences into two 
major groups: Group A consists of Iranian and Japanese/North 
American/European subgroups, and Group B consists of Greek, 
Turkish and Iranian subgroups. Although most of the isolates from 
each country were placed into a single subgroup, those of Iran fell 
into two. Interestingly, the Iran II isolates that clustered with 
Turkish isolates in Group B came from the Khorasan Razavi 
district (see Table SI in File SI), which is in north-eastern Iran and 
is not adjacent to Turkey. The topology of Group B showed a 
geographically hierarchical pattern of evolution, with the Turkish 
population diverging from the Greek population, and the Iranian 
population diverging from the Turkish population. 



The nucleotide diversity of ORF VI from the Japanese samples in 
Group A was greater (0.03849) than those of Iran and USA, 
whereas greater diversity was found in the Greek samples in 
Group B (although only a small number of Greek isolates were 
used for these calculations). Nucleotide diversity was highest in 
Iran (0.06934). In ORFs I-V, nucleotide diversity was higher in 
Turkey (0.02776) than in Greece, Iran, or Japan. In estimating 
these genetic difierences, we assumed that the population of each 
country evolved independendy, although the sampling area in 
each country might influence our estimates. 

The cluster-based method implemented in Structure was used 
to identify individuals that were admixed or had migrated among 
brassica-infecting CaMV populations. Our analysis supported six 
subpopulations in ORFs I-V (Figure 2A) and five in ORF VI 
(Figure 2B). Many individuals contain substantial numbers of 
nucleotide polymorphisms that are apparently characteristic of 
ORFs I-V subpopulations, that are colour-coded in Figure 2. The 
Japan/USA/Europe cluster consisted of yellow, red, and dark 
pink subpopulations, and the Japanese isolates seemed to be 
divided into two subpopulations. On the other hand, the Iranian 
cluster consisted of yellow, green, and blue subpopulations, with 
the last two being dominant. Turkish clusters consisted of yellow, 
light pink, and green populations, with the Kght pink subpopu- 
lation being predominant. All of the clusters included the yellow 
subpopulation, and this might be ancestral the ancestral popula- 
tion. Most individual clusters have a predominant subpopulation 
in ORF VI (Figure 2B). The major subpopulations of Japan, Iran, 
and Turkey were red/dark pink, blue, and green, respectively. 
The Bari 1 isolate was part of the yellow subpopulation, which 
might be the ancestral isolate of the CaMV subpopulation seen in 
the Neighbor-Net tree (Figure SIB in File SI). Although the 
proportion of the yeUow subpopulation was small in all clusters, 
the subpopulation was admixed with other individuals in all 
clusters. Our results suggest that CaMV became geographically 
segregated, but with frequent spread between regions. 



Genetic population structure 

We compared the haplotype and nucleotide diversities of 
CaMV populations and subpopulations in each country (data not 
shown). The haplotype diversity in most groups exceeded 0.95. 



Evolutionary rates and timescales 

We used a Bayesian phylogenetic method to estimate the 
evolutionary rates and timescales for the individual genomic 
regions. Based on the results of our PATRISTIC analyses, we 
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Figure 3. Bayesian phylogenetic estimates from ORFs l-V and ORF VI of Cauliflower mosaic virus. Maxlmum-clade-credlbility trees from 
BEAST analyses of 66 and 97 isolates of ORFs l-V (A) and ORF VI (B), respectively. Branch colours correspond to the most probable geographic 
location of their descendent nodes. 
doi:10.1371/journal.pone.0085641.g003 



analyzed a concatenated alignment of ORFs I-V and a separate 
alignment of ORF VI. The best-supported demographic models 
were exponential growth for ORFs I-V and constant size for ORF 
VI (Table S2 in File SI). For both data sets, a relaxed-clock model 
provided a better fit than the strict-clock model (Table 2). To 
determine whether there was temporal structure in the ORFs I-V 
and ORF VI data sets, we fitted a linear regression between 
collection date and the root-to-tip genetic divergence using Path- 
O-Gen vl.3 (Figure S4 in File SI). For ORFs I-V and ORF VI, 
we obtained respective i?-squared values of —0.20 1 and 0. 160, and 
respective P-values of 0.104 and 0.119. These results indicate that 
the relationship between collection date and sampling time is not 
significant, so the molecular clock hypothesis is rejected for these 
data sets. 

Nonetheless our analyses of date-randomized replicates revealed 
that the sampling times of ORFs I-V and ORF VI had sufficient 
temporal structure for calibration of the molecular clock (Figure S5 
in File SI). This was indicated by the smaller 95% credibility 
intervals of the rate estimates from the original data set compared 
with the date-randomized replicates. In addition, the mean 
posterior rate estimates from the original data were not contained 
with the 95% credibility intervals of the rate estimates from the date- 
randomized replicates. The mean estimated substitution rates were 
1.71x10"* subs/site/year for ORFs I-V and 5.81 x 10"* subs/site/ 
year for ORF VI (Table 2). Estimates of the age of the root were 49 1 
years for ORFs I-V and 43 1 years for ORF VI (Table 2, Figure 3). 

Patterns of viral migration 

Our Bayesian phylogenetic analysis of the origin and global 
spread of CaMV showed strong Bayes factor (BF) support from 
ORFs I-V hat the virus had spread from Turkey to Greece 
(BF = 205) and to Iran (BF = 61) (Figure 4). There was also some 
support for spread from Turkey to Japan (BF= 14). The ORF VI 



data supported spread from Greece to Turkey (BF = 230) and to 
Iran (BF= 128), and from Japan to USA (BF= 112). 

Discussion 

We aimed to understand the migration dynamics and spread of 
CaMV in their natural hosts by utilizing over 50 years of 
surveillance data. Our analyses show that the samples from 
Europe, Japan, Middle East and USA, including the regions where 
various Brassicaceae were first domesticated, seems to have 
captured a significant sample of the global genetic diversity of 
CaMV. The presence of as-yet-uncoUected CaMV infecting 
diflFerent non-brassica plant species may have biased our analysis 
against the detection of heterotopic processes. We recentiy 
presented a similar case study for TuMV evolution using wild 
orchid and brassica isolates [9]. 

Our comparisons of the ML trees of the individual ORFs using 
PATRISTIC showed that the ORFs I-V shared similar evolu- 
tionary histories, and this was diflFerent from that of ORF VI 
(Figure 1). The ORF I-V proteins are expressed from 35S RNA, 
whereas ORF VI protein is from 19S RNA. ORF VT protein is the 
major component of cytoplasmic inclusion bodies and the 
structures called viroplasms, which are thought to be 'virion 
factories'. Additionally, this protein is an essential determinant of 
host range, affects symptom severity [20], and is known to 
transactivate the translation of ORFs I-V from the polycistronic 
35S protein [20], [59]. Interestingly, attenuated isolates of three 
Japanese JPNN, JPNSl, and JPNS2 were found in the present 
study, and the isolates grouped together in the ORF VI tree 
(Figure SIB in File SI). 

Recombination is an important source of genetic variation not 
only for CaMV [30,60] but also for many other plant viruses [3], 
[13], [61-63]. We report several phylogenetic patterns that might 
have resulted from recombination in CaMV and that have not 




Figure 4. Patterns of Cauliflower mosaic i/vri/s migration jointly estimated across tlie two ORF regions. ORFs l-V and ORF VI migrations are 
shown by solid and dashed lines. Lines connecting discrete regions indicate statistically supported ancestral state changes and their thicknesses 
denote statistical support. There are five categories of support. In increasing order, line thicknesses indicate 6<BF<10 (positive support); lOsBFOO 
(strong support); 30<BF<100 (very strong support); and BFalOO (decisive support). Migration line was not shown when they were represented by 
only a single sample. 
doi:1 0.1 371/journal.pone.0085641 .g004 
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previously been found in tlie isolates from North America [30]. 
Additionally, although recombination sites have not been found in 
the ORF VI region [64], we found that many isolates from 
Europe, Iran, Japan and USA isolates were recombinants, with 
sites located at the 5' and 3' ends of ORF VI (Table 1, Figure S2 
in File SI). Our results suggest that these two sites are 
recombination hot spots in CaMV. The recombination hot spot 
at the 5' end in ORF VI is located in the middle of reported 
virulence /avirulence [65] and pathogenicity domains [59], [66]. 
The present geographical distributions of the various CaMV 
recombinant lincagc-s imply that there have been complex patterns 
of CaMV movement throughout the world. 

Our estimates of the genetic population structure have shown 
that there has been frequent spread between regions (Figure 2). 
However, the structure of ORF VT (Figure 2B) showed clear 
geographical segregation at the primary divergence of the CaMV 
population, which was not shown by ORFs I-V (Figure 2A). The 
same divergences were shown by the Neighbor-Net trees of the 
same data (Figure S 1 in File SI). Our Bayesian phylogenetic analysis 
revealed that ORFs I-V and ORF VI support different local 
migration patterns for CaMV. For instance, ORFs I-V showed that 
CaMV migrated from Turkey to Greece and Iran, whereas ORF 
VI data set showed that the virus from Greece and then spread to 
Turkey or Iran. This suggests that there was insufficient phyloge- 
netic signal to reveal unequivocally the complex patterns of 
migration in the CaMV populations in the past. The Neighbour- 
Net tree (Figure SIB in File SI) was estimated from ORF VI 
sequences that included one from the Italian Baril isolate. The 
position of this isolate in the ORF VI tree suggests that there might 
be a third distinct CaMV population that is yet to be sampled and 
sequenced. The different migration patterns in different regions 
might reflect characteristics of CaMV transmission and geograph- 
ical barriers. CaMV is transmitted by aphids in a semi-persistent 
manner and they are able to only carry the virus for a short time. 
Mountains, deserts, country-dependent agriculture crops and 
growing conditions of crops may present obstacles to the spread 
of aphids, thus limiting the spread of the virus. Physical obstacles 
have also been reported to be responsible for the strain localization 
oi Rice yellow mottle vims [67] and Tobacco vein banding mosaic vims [68] . 

CaMV mainly infects brassica crops, including cabbage, 
broccoli and cauliflower. Non-heading cabbages and kale were 
probably domesticated before 1000 BC in Eurasia [69], but were 
not taken to North America and Japan until the 1 7th and 1 9th 
centuries respectively. Broccoli and perhaps cauliflower originated 
from kale, and first appeared in the east Mediterranean. Broccoli 
and cauliflower spread from Italy to other European countries 
around the 16th to 19th centuries, prior to their introduction into 
North America and Japan in 19th to 20th centuries [70], [71]. 
Our estimate of the divergences in the tree of ORF VI shows that 
the primary divergence was around 450 years ago, but the 
divergences of the subgroup lineages occurred about 100-200 
years ago (Figure 3B). Thus our well-supported estimate of the 
time to the most recent common ancestor of CaMV lineages based 
on the ORFs VI sequences is consistent with the global trade in 
broccoli, cauliflower and other brassica species grown as 
antiscorbutics, from Europe to other parts of the world. This 
timing also suggests that aphids were not responsible for the 
primary global spread of CaMV. Further global sampling of 
CaMV isolates is needed to confirm these results and the 
discrepancy between the topologies of the ORF I-V and ORF 
VI trees, nonetheless the age of the ancestor of CaMV fits neady 
with the timescale of migration of brassica crops across the world. 

We have interpreted our results while assuming that CaMV has 
evolved in a straightforward manner. We have concluded that the 



apparent difference in phylogeny between the ORFs I-V and ORF 
VI genes results from an inadequate phylogenetic signal in ORFs I- 
V, as shown by the lack of bootstrap support for the basal nodes of 
trees estimated from those sequences. However it is important to 
note that the evolution of CaMV, a pararetrowus, may be unusual. 
CaMV has an unusually high recombination rate [60], and its 
populations have very large effective sizes [72]. Another paraere- 
trovirus. Banana streak vims, exists as both a virus and as endogenous 
elements integrated within the host genome with, probably, 
completely different evolutionary rates [73]. It is also noteworthy 
that the 35S promoter that is widely used in transgenic plant 
research includes much of ORF VI [74]. Thus, the unexpected 
should be expected in studies of the molecular phylogenetics of 
caulimovirids, not only in the gene sequences themselves but also in 
their behavior in the methods used to analyze them. 

In conclusion, our study has shown that (i) recombination is 
common in CaMV; (ii) ORFs I-V and ORF VI of its genome 
show different evolutionary patterns; (iii) the ORFs are evolving at 
a rate in the range of 1.71-5.81x10 * substitutions/site/year, 
which is similar to that of RNA and ssDNA viruses; (iv) ORF VI is 
the most rapidly evolving ORF; (v) there is evidence of at least four 
geographically confined lineages of CaMV; (vi) CaMV probably 
spread from a single population to other parts of the world around 
400-500 years ago; (vii) CaMV is widely distributed in Eurasian 
countries; and (viii) there is evidence of frequent spread between 
Turkey and neighboring countries, and similarly between Japan 
and the USA. This is the first report on the spatial and temporal 
spread of a plant pararetrovirus. 

Supporting Information 

File SI Figures S1-S5 & Tables S1-S2. 

Figure SI. Phylogenetic evidence for recombination among 
Cauliflower mosaic virus from the Europe, Japan, Middle East (Iran 
and Turkey) and USA. ORFs I-V (A) and ORF VI (B). Neighbor- 
Net network analysis was performed using SplitsTree4. Horseradish 
latent virus is used as the outgroup. Formation of a reticular network 
rather than a single bifurcated tree is suggestive of recombination. 
The isolates obtained in this study are listed in Table S 1 in File S 1 . 
(PDF) 

Figure S2. Recombination analysis by RAT plot. Each blue line 
represents a pairwise sequence comparison. The red curve 
represents the estimated proportion of recombinants at each 
position in the alignment. The red vertical lines denote estimated 
positions of recombination breakpoints, which approximately match 
the boundaries of the ORF VI region. The estimated nucleotide 
positions of the recombination sites are shown relative to the 5' end 
of the genome, using numbering of the gapped aligned sequences 
with gaps removed (see Materials and methods). Recombination 
sites in parentheses are shown relative to the 5 ' end of the genome 
using numbering of the sequence of the Xinjing isolate. 
(PDF) 

Figure S3. Maximum-likelihood tree estimated from ORF VI of 
105 non-recombinant Cauliflower mosaic virus isolates. Nodes are 
labelled with bootstrap support percentages. 
(PDF) 

Figure S4. Regression of root-to-tip distance (inferred from 
Maximum-likeUhood trees) against year of isolation for the gene 
with the smallest number of sequences in each ORF region. 

(PDF) 

Figure S5. Estimates of nucleotide substitution rates. Mean 
estimates and 95% credibility intervals are shown. These were 
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estimated from 66 ORFs I-V and 97 ORF VI (see text). In each 
set of estimates, the first is based on the original data, whereas the 
remaining ten values are from date-randomized replicates. The 
95% credibility intervals of the estimates from the date- 
randomized replicates do not overlap with the mean posterior 
estimate from the original data set. In addition, the lower tails of 
the credibility intervals are long and tend towards zero. These 
features suggest that there is sufficient temporal structure in the 
original data sets for rate estimation. 
(PDF) 

Table SI. Cauliflower mosaic vims isolates analyzed in this study. 
(DOC). 

Table S2. Detailed results from BEAST analyses of Cauliflower 

mosaic virus. 

(DOC) 
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